R 語言的五十道練習¶

資料視覺化

數據交點 | 郭耀仁 yaojenkuo@datainpoint.com

關於資料視覺化¶

為什麼資料需要視覺化?¶

視覺化在探索性資料分析中佔有舉足輕重的地位,因為對人類來說暸解原始數列資料或者函數是極其困難的。

抽象的原始數列資料¶

In [1]:
v <- rnorm(1000)
print(v[1:10])
 [1]  0.91203193  0.17795723  0.63266643  1.12816748  1.05346457  1.33972304
 [7] -0.52669067  0.08265482  1.50591506  1.72320542
In [2]:
hist(v)

抽象的函數¶

\begin{equation} f(x) = \frac{1}{1 + e^{-x}} \end{equation}
In [3]:
x <- seq(from = -10, to = 10, length.out = 100)
f <- 1/(1 + exp(-x))
plot(x, f, type = "l")

資料視覺化經典案例¶

  • Charles Minard's map of Napoleon's disastrous Russian campaign of 1812.
  • Hans Rosling's 200 Countries, 200 Years, 4 Minutes.

好的視覺化要素¶

  • 有資訊價值。
  • 簡潔。
  • 美觀。

範例資料集¶

獲得 gapminder 範例資料集¶

  • 安裝 gapminder 套件
  • 載入 gapminder 套件

安裝 gapminder 套件¶

  • 透過 RStudio 的 Packages 功能頁籤
  • 透過 install.pacakges() 函數
install.packages("gapminder")

載入 gapminder 套件¶

  • 透過 RStudio 的 Packages 功能頁籤
  • 透過 library() 函數
library("gapminder")

gapminder 範例資料集的外觀¶

In [4]:
library("gapminder")

print(dim(gapminder))
[1] 1704    6
In [5]:
head(gapminder, 3)
A tibble: 3 × 6
countrycontinentyearlifeExppopgdpPercap
<fct><fct><int><dbl><int><dbl>
AfghanistanAsia195228.801 8425333779.4453
AfghanistanAsia195730.332 9240934820.8530
AfghanistanAsia196231.99710267083853.1007
In [6]:
head(gapminder, 3)
A tibble: 3 × 6
countrycontinentyearlifeExppopgdpPercap
<fct><fct><int><dbl><int><dbl>
AfghanistanAsia195228.801 8425333779.4453
AfghanistanAsia195730.332 9240934820.8530
AfghanistanAsia196231.99710267083853.1007

gapminder 範例資料集有幾個國家?幾個洲別?¶

In [7]:
print(length(unique(gapminder$country)))
print(unique(gapminder$continent))
[1] 142
[1] Asia     Europe   Africa   Americas Oceania 
Levels: Africa Americas Asia Europe Oceania

gapminder 範例資料集有哪些年份?¶

In [8]:
print(unique(gapminder$year))
 [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007

ggplot2 基礎¶

什麼是 ggplot2?¶

ggplot2 以簡潔、彈性和美觀輸出快速擄獲資料科學團隊的芳心;命名之中 gg 指的是 grammer of graphics,套件作者是 Hadley Wickham 與 Winston Chang,核心理念是利用正規而有結構的文法來探索資料。

安裝 ggplot2 套件¶

  • 透過 RStudio 的 Packages 功能頁籤。
  • 透過 install.packages() 函數。
install.packages("ggplot2")

載入 ggplot2 套件¶

  • 透過 RStudio 的 Packages 功能頁籤。
  • 透過 library() 函數。
library("ggplot2")

基礎視覺化圖形¶

  • 觀察資料相關性的散佈圖(Scatter Plot)。
  • 觀察排序的長條圖(Bar Plot)。
  • 觀察資料分佈的直方圖(Histogram)。
  • 觀察數值變化趨勢的線圖(Line Plot)。
  • 觀察不同類別資料分佈的盒鬚圖(Boxplot)。

如何建立一個 ggplot2 圖形¶

  • 使用 ggplot() 函數做資料映射。
  • 使用 geom_() 函數調整圖形種類。
  • 使用 + 連結不同的函數,堆疊圖層。
In [9]:
library("gapminder")               # data
library("ggplot2")                 # plotting
suppressMessages(library("dplyr")) # data manipulations

觀察資料相關性的散佈圖(Scatter Plot)¶

使用 ggplot(aes(x, y)) + geom_point()

In [10]:
gapminder %>% 
  ggplot(aes(x = gdpPercap, y = lifeExp)) +
    geom_point()

觀察排序的長條圖(Bar Plot):長條高度為觀測值個數¶

使用 ggplot(aes(x)) + geom_bar()

In [11]:
gapminder %>% 
  filter(year == 2007) %>%
  ggplot(aes(x = continent)) +
    geom_bar()

觀察排序的長條圖(Bar Plot):長條高度為摘要數值¶

使用 ggplot(aes(x, y)) + geom_bar(stat = "identity")

In [12]:
gapminder %>% 
  filter(year == 2007) %>% 
  mutate(pop_numeric = as.numeric(pop)) %>%
  group_by(continent) %>% 
  summarise(ttl_pop = sum(pop_numeric)) %>% 
  ggplot(aes(x = continent, y = ttl_pop)) +
    geom_bar(stat = "identity")
`summarise()` ungrouping output (override with `.groups` argument)

觀察資料分佈的直方圖(Histogram)¶

使用 ggplot(aes(x)) + geom_histogram()

In [13]:
gapminder %>% 
  ggplot(aes(x = gdpPercap)) +
    geom_histogram(bins = 40)

觀察數值變化趨勢的線圖(Line Plot)¶

使用 ggplot(aes(x, y)) + geom_line()

In [14]:
gapminder %>% 
  filter(country %in% c("Taiwan", "Japan", "China")) %>% 
  ggplot(aes(x = year, y = gdpPercap, color = country)) +
    geom_line()

觀察不同類別資料散佈的盒鬚圖(Boxplot)¶

使用 ggplot(aes(x, y)) + geom_boxplot()

In [15]:
gapminder %>% 
  ggplot(aes(x = continent, y = gdpPercap, color = continent)) +
    geom_boxplot()

ggplot2 技巧¶

常用的 ggplot2 技巧¶

  • 加入圖標題與軸標籤。
  • 加入註釋。
  • 加入中文字(macOS 使用者會遭遇的問題)。
  • 調整座標軸。
  • 在一個畫布上繪製多個子圖形。

加入圖標題與軸標籤¶

使用 ggtitle() + xlab() + ylab() 函數。

In [16]:
gapminder %>% 
  ggplot(aes(x = gdpPercap, y = lifeExp)) +
    geom_point() +
    ggtitle("Wealth vs. Health") +
    xlab("GDP Per Capita") +
    ylab("Life Expectancy")

加入註釋¶

使用 geom_text() 函數。

In [17]:
n_obs <- gapminder %>% 
  group_by(continent) %>% 
  summarise(nrows = n())
n_obs %>% 
  ggplot(aes(x = continent, y = nrows)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = nrows, y = nrows), vjust = -1)
`summarise()` ungrouping output (override with `.groups` argument)

加入中文字(macOS 使用者會遭遇的問題)¶

使用 theme(text = element_text(family = FONTS_SUPPORT_TC)) 函數。

In [18]:
p <- gapminder %>% 
  ggplot(aes(x = gdpPercap, y = lifeExp)) +
    geom_point() +
    ggtitle("財富與健康") +
    xlab("人均 GDP") +
    ylab("預期壽命") +
    theme(text = element_text(family = "Heiti TC Light"))
suppressWarnings(print(p))

調整座標軸¶

使用 scale_x_continuous() 與 scale_y_continuous() 函數調整座標軸上下界與量尺。

In [19]:
gapminder %>% 
  ggplot(aes(x = gdpPercap, y = lifeExp)) +
    geom_point(na.rm = TRUE) +
    scale_x_continuous(limits = c(0, 50000))
In [20]:
gapminder %>% 
  ggplot(aes(x = gdpPercap, y = lifeExp)) +
    geom_point() +
    scale_x_continuous(trans = "log10")

在一個畫布上繪製多個子圖形¶

使用 facet_wrap(vars(CATEGORICAL_COLUMN)) 函數。

In [21]:
gapminder %>% 
  ggplot(aes(x = gdpPercap, y = lifeExp, color = continent)) +
    geom_point() +
    facet_wrap(vars(continent))

使用 ggplot2 視覺化真實世界資料¶

範例資料來源¶

COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University

In [22]:
get_daily_report <- function() {
    file_date <- format(Sys.Date() - 2, "%m-%d-%Y")
    csv_url <- paste0("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/",
                      "csse_covid_19_daily_reports/",
                      file_date,
                      ".csv"
                     )
    daily_report <- read.csv(csv_url)
    return(daily_report)
}
daily_report <- get_daily_report()
In [23]:
head(daily_report)
A data.frame: 6 × 14
FIPSAdmin2Province_StateCountry_RegionLast_UpdateLatLong_ConfirmedDeathsRecoveredActiveCombined_KeyIncident_RateCase_Fatality_Ratio
<int><chr><chr><chr><chr><dbl><dbl><int><int><lgl><lgl><chr><dbl><dbl>
1NAAfghanistan 2021-09-01 04:21:38 33.93911 67.709951532207118NANAAfghanistan 393.59504.6456076
2NAAlbania 2021-09-01 04:21:38 41.15330 20.168301463872498NANAAlbania 5086.76771.7064357
3NAAlgeria 2021-09-01 04:21:38 28.03390 1.659601960805269NANAAlgeria 447.15012.6871685
4NAAndorra 2021-09-01 04:21:38 42.50630 1.52180 15033 130NANAAndorra 19456.41620.8647642
5NAAngola 2021-09-01 04:21:38-11.20270 17.87390 475441217NANAAngola 144.65902.5597341
6NAAntigua and Barbuda2021-09-01 04:21:38 17.06080-61.79640 1715 44NANAAntigua and Barbuda 1751.28672.5655977

以長條圖視覺化前十大確診數國家¶

In [24]:
confirmed_by_countries <- daily_report %>% 
    group_by(Country_Region) %>% 
    summarise(Confirmed = sum(Confirmed)) %>% 
    arrange(desc(Confirmed))
head(confirmed_by_countries)
`summarise()` ungrouping output (override with `.groups` argument)

A tibble: 6 × 2
Country_RegionConfirmed
<chr><int>
US 39198131
India 32810845
Brazil 20776870
France 6834930
United Kingdom 6821356
Russia 6820697
In [25]:
top_ten_countries <- rev(confirmed_by_countries$Country_Region[1:10])
p <- confirmed_by_countries %>%
    head(10) %>% 
    mutate(Country_Region=factor(Country_Region, levels=top_ten_countries)) %>% 
    ggplot(aes(x = Country_Region, y = Confirmed)) +
    geom_bar(stat = "identity") + 
    coord_flip()
In [26]:
p
In [27]:
top_ten_countries <- rev(confirmed_by_countries$Country_Region[1:10])
top_ten_confirmed <- confirmed_by_countries$Confirmed[1:10]

p <- confirmed_by_countries %>%
    head(10) %>% 
    mutate(Country_Region=factor(Country_Region, levels=top_ten_countries)) %>% 
    ggplot(aes(x = Country_Region, y = Confirmed)) +
    geom_bar(stat = "identity") + 
    geom_text(aes(label = top_ten_confirmed, y = top_ten_confirmed), hjust = -0.1) +
    scale_y_continuous(limits = c(0, 50000000)) +
    coord_flip()
In [28]:
p

以線圖視覺化確診數趨勢¶

In [29]:
library("tidyr")

get_time_series_confirmed <- function() {
    csv_url <- paste0("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/",
                      "csse_covid_19_data/csse_covid_19_time_series/",
                      "time_series_covid19_confirmed_global.csv")
    time_series_confirmed <- read.csv(csv_url)
    cols_to_pivot_longer <- colnames(time_series_confirmed)[5:ncol(time_series_confirmed)]
    time_series_confirmed_long <- time_series_confirmed[, c(2, 5:ncol(time_series_confirmed))] %>% 
        pivot_longer(cols = all_of(cols_to_pivot_longer),
                     names_to = "Date",
                     values_to = "Confirmed"
                    )
    time_series_confirmed_long <- time_series_confirmed_long %>% 
        group_by(Country.Region, Date) %>% 
        summarise(Confirmed = sum(Confirmed))
    time_series_confirmed_long$Date <- time_series_confirmed_long$Date %>% 
        sub(pattern = "X", replacement = "") %>% 
        gsub(pattern = ".", replacement = "-", fixed = TRUE) %>% 
        as.Date("%m-%d-%y")
    return(time_series_confirmed_long)
}
In [30]:
time_series_confirmed <- get_time_series_confirmed()
time_series_confirmed
`summarise()` regrouping output by 'Country.Region' (override with `.groups` argument)

A grouped_df: 114855 × 3
Country.RegionDateConfirmed
<chr><date><int>
Afghanistan2021-01-0151526
Afghanistan2021-01-1053489
Afghanistan2021-01-1153538
Afghanistan2021-01-1253584
Afghanistan2021-01-1353584
Afghanistan2021-01-1453775
Afghanistan2021-01-1553831
Afghanistan2021-01-1653938
Afghanistan2021-01-1753984
Afghanistan2021-01-1854062
Afghanistan2021-01-1954141
Afghanistan2021-01-0251526
Afghanistan2021-01-2054278
Afghanistan2021-01-2154403
Afghanistan2020-01-22 0
Afghanistan2021-01-2254483
Afghanistan2020-01-23 0
Afghanistan2021-01-2354559
Afghanistan2020-01-24 0
Afghanistan2021-01-2454595
Afghanistan2020-01-25 0
Afghanistan2021-01-2554672
Afghanistan2020-01-26 0
Afghanistan2021-01-2654750
Afghanistan2020-01-27 0
Afghanistan2021-01-2754854
Afghanistan2020-01-28 0
Afghanistan2021-01-2854891
Afghanistan2020-01-29 0
Afghanistan2021-01-2954939
⋮⋮⋮
Zimbabwe2021-09-01124960
Zimbabwe2020-09-10 7453
Zimbabwe2020-09-11 7479
Zimbabwe2020-09-12 7508
Zimbabwe2020-09-13 7526
Zimbabwe2020-09-14 7531
Zimbabwe2020-09-15 7576
Zimbabwe2020-09-16 7598
Zimbabwe2020-09-17 7633
Zimbabwe2020-09-18 7647
Zimbabwe2020-09-19 7672
Zimbabwe2020-09-02 6638
Zimbabwe2020-09-20 7683
Zimbabwe2020-09-21 7683
Zimbabwe2020-09-22 7711
Zimbabwe2020-09-23 7725
Zimbabwe2020-09-24 7752
Zimbabwe2020-09-25 7787
Zimbabwe2020-09-26 7803
Zimbabwe2020-09-27 7812
Zimbabwe2020-09-28 7816
Zimbabwe2020-09-29 7837
Zimbabwe2020-09-03 6678
Zimbabwe2020-09-30 7838
Zimbabwe2020-09-04 6837
Zimbabwe2020-09-05 6837
Zimbabwe2020-09-06 6837
Zimbabwe2020-09-07 7298
Zimbabwe2020-09-08 7388
Zimbabwe2020-09-09 7429
In [31]:
p <- time_series_confirmed %>% 
    filter(Country.Region == "Taiwan*") %>% 
    ggplot(aes(x = Date, y = Confirmed)) + 
    geom_line()
p
In [32]:
p <- time_series_confirmed %>% 
    filter(Country.Region %in% c("Taiwan*", "China", "Japan", "Korea, South", "Singapore")) %>% 
    ggplot(aes(x = Date, y = Confirmed, colour = Country.Region)) + 
    geom_line()
p

以長條圖視覺化每日新增確診數趨勢¶

In [33]:
time_series_confirmed <- time_series_confirmed %>% 
    filter(Country.Region == "Taiwan*") %>% 
    arrange(Date)
confirmed_lag <- time_series_confirmed$Confirmed %>% 
    lag()
daily_increase <- time_series_confirmed$Confirmed - confirmed_lag
In [34]:
time_series_confirmed$Daily_Increase <- daily_increase
time_series_confirmed
A grouped_df: 589 × 4
Country.RegionDateConfirmedDaily_Increase
<chr><date><int><int>
Taiwan*2020-01-22 1NA
Taiwan*2020-01-23 1 0
Taiwan*2020-01-24 3 2
Taiwan*2020-01-25 3 0
Taiwan*2020-01-26 4 1
Taiwan*2020-01-27 5 1
Taiwan*2020-01-28 8 3
Taiwan*2020-01-29 8 0
Taiwan*2020-01-30 9 1
Taiwan*2020-01-3110 1
Taiwan*2020-02-0110 0
Taiwan*2020-02-0210 0
Taiwan*2020-02-0310 0
Taiwan*2020-02-0411 1
Taiwan*2020-02-0511 0
Taiwan*2020-02-0616 5
Taiwan*2020-02-0716 0
Taiwan*2020-02-0817 1
Taiwan*2020-02-0918 1
Taiwan*2020-02-1018 0
Taiwan*2020-02-1118 0
Taiwan*2020-02-1218 0
Taiwan*2020-02-1318 0
Taiwan*2020-02-1418 0
Taiwan*2020-02-1518 0
Taiwan*2020-02-1620 2
Taiwan*2020-02-1722 2
Taiwan*2020-02-1822 0
Taiwan*2020-02-1923 1
Taiwan*2020-02-2024 1
⋮⋮⋮⋮
Taiwan*2021-08-031572119
Taiwan*2021-08-041574221
Taiwan*2021-08-051575311
Taiwan*2021-08-061576512
Taiwan*2021-08-071577510
Taiwan*2021-08-0815782 7
Taiwan*2021-08-0915790 8
Taiwan*2021-08-1015798 8
Taiwan*2021-08-111581416
Taiwan*2021-08-1215820 6
Taiwan*2021-08-131583616
Taiwan*2021-08-1415843 7
Taiwan*2021-08-1515852 9
Taiwan*2021-08-161586210
Taiwan*2021-08-171588018
Taiwan*2021-08-181589111
Taiwan*2021-08-1915897 6
Taiwan*2021-08-2015906 9
Taiwan*2021-08-211591610
Taiwan*2021-08-221592610
Taiwan*2021-08-2315932 6
Taiwan*2021-08-2415938 6
Taiwan*2021-08-2515939 1
Taiwan*2021-08-2615947 8
Taiwan*2021-08-2715954 7
Taiwan*2021-08-2815960 6
Taiwan*2021-08-291598323
Taiwan*2021-08-3015991 8
Taiwan*2021-08-3115995 4
Taiwan*2021-09-0116001 6
In [35]:
p <- time_series_confirmed %>% 
    filter(!is.na(Daily_Increase)) %>% 
    ggplot(aes(x = Date, y = Daily_Increase)) +
    geom_bar(stat = "identity", na.rm = TRUE)
p

使用 plotly 複製一個 gapminder¶

關於 plotly 套件¶

幫助 R 語言使用者不需要額外去學習 JavaScript 就能夠建立出互動性、具備 D3.js 及 WebGL 特性的圖表

安裝 plotly 套件¶

  • 透過 RStudio 的 Packages 功能頁籤
  • 透過 install.packages() 函數
install.packages("plotly")

載入 plotly 套件¶

  • 透過 RStudio 的 Packages 功能頁籤
  • 透過 library() 函數
library("plotly")
In [36]:
suppressMessages(library("plotly"))
radius <- sqrt((gapminder$pop)/pi)

p <- gapminder %>%
  plot_ly(
    x = ~gdpPercap, 
    y = ~lifeExp, 
    size = ~pop, 
    color = ~continent, 
    frame = ~year, 
    text = ~country, 
    fill = ~'',
    hoverinfo = "text",
    type = 'scatter',
    mode = 'markers',
    sizes = c(min(radius), max(radius))
  ) %>%
  layout(
    xaxis = list(
      type = "log"
    )
  )
In [37]:
p